Tarea realizada por Carlos Sánchez Polo y Jesús Martínez Leal¶

Última edición: 01/03/2024

Table of contents

  • Ejemplo inicial
    • Cargar datos
    • Crear train/test split
    • Run t-SNE
    • Transformación
    • Todo junto
  • Ejercicios con OpenTSNE
    • Load data
    • - Empieza ejecutando la función con los valores por defecto (perplexity = 30, early_exaggeration = 12, initialization='pca') para un subconjunto de train del 75% de la muestra.
    • - Ejecuta el modelo sin early_exaggeration (early_exaggeration=1). ¿Qué diferencias observas y a qué se deben?
    • - Ejecuta el modelo con los valores por defecto pero cambiando la inicialización a random. ¿Qué ocurre? ¿Obtenemos mejores o peores resultados que en el caso anterior? Compara también los tiempos de ejecución y comenta porqué difieren.
    • - Ejecuta el modelo con 2 valores muy dispares de perplexity, por ejemplo 1 y 100, (y el resto de valores por defecto) y comenta los resultados.
    • - De todas las configuraciones de t-SNE probadas en los ejercicios anteriores, escoge la que mejores resultados obtiene y aplica los datos de test al embedding. Representa el dataset entero.
  • Ejercicios con TSNE de sklearn
    • Ejecuta t-sne de sklearn con el dataset de los círculos variando la perplexity (valores 5, 30, 100).
    • ¿Que KL obtienes en cada caso?
    • Compara los tiempos de ejecución de Barnes-Hut con el método exacto. Utiliza el valor de perplexity que mejor resultado haya obtenido según el ejercicio anterior.

Ejemplo inicial¶

Descargarse el fichero utils de https://github.com/pavlin-policar/openTSNE/blob/master/examples/utils.py

In [ ]:
from openTSNE import TSNE
from resources import utils
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Cargar datos¶

En este ejemplo se utiliza el dataset Macosko 2015, que son datos de retina de ratón. Se trata de un dataset bastante conocido y bastante explorado en la literatura. Se puede obtener en el siguiente enlace: http://file.biolab.si/opentsne/macosko_2015.pkl.gz

In [ ]:
import gzip
import pickle

with gzip.open("data/macosko_2015.pkl.gz", "rb") as f:
    data = pickle.load(f)

x = data["pca_50"]
y = data["CellType1"].astype(str)

print("Data set contains %d samples with %d features" % x.shape)
Data set contains 44808 samples with 50 features

Crear train/test split¶

In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=42)

print("%d training samples" % x_train.shape[0])
print("%d test samples" % x_test.shape[0])
30021 training samples
14787 test samples

Run t-SNE¶

Primero se creará un embedding de los datos. En el siguiente enlace encontrarás los parámetros de entrada de la función tsne.

https://opentsne.readthedocs.io/en/latest/api/index.html

In [ ]:
tsne = TSNE(
    perplexity=30,
    metric="euclidean",
    n_jobs=-1,
    random_state=42,
    verbose=True,
)
In [ ]:
%time embedding_train = tsne.fit(x_train)

utils.plot(embedding_train, y_train, colors=utils.MACOSKO_COLORS)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, n_jobs=-1, random_state=42, verbose=True)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 2.80 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.57 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.08 seconds
===> Running optimization with exaggeration=12.00, lr=2501.75 for 250 iterations...
Iteration   50, KL divergence 5.1602, 50 iterations in 1.1982 sec
Iteration  100, KL divergence 5.1000, 50 iterations in 1.2712 sec
Iteration  150, KL divergence 5.0648, 50 iterations in 1.2973 sec
Iteration  200, KL divergence 5.0503, 50 iterations in 1.3088 sec
Iteration  250, KL divergence 5.0422, 50 iterations in 1.3080 sec
   --> Time elapsed: 6.38 seconds
===> Running optimization with exaggeration=1.00, lr=30021.00 for 500 iterations...
Iteration   50, KL divergence 3.0021, 50 iterations in 1.3138 sec
Iteration  100, KL divergence 2.7919, 50 iterations in 2.3479 sec
Iteration  150, KL divergence 2.6944, 50 iterations in 3.5985 sec
Iteration  200, KL divergence 2.6360, 50 iterations in 4.9340 sec
Iteration  250, KL divergence 2.5954, 50 iterations in 5.9422 sec
Iteration  300, KL divergence 2.5646, 50 iterations in 7.0011 sec
Iteration  350, KL divergence 2.5405, 50 iterations in 7.9974 sec
Iteration  400, KL divergence 2.5218, 50 iterations in 8.5782 sec
Iteration  450, KL divergence 2.5051, 50 iterations in 9.5600 sec
Iteration  500, KL divergence 2.4925, 50 iterations in 10.5547 sec
   --> Time elapsed: 61.83 seconds
CPU times: total: 13min 5s
Wall time: 1min 11s
No description has been provided for this image

Transformación¶

Actualmente openTSNE es la única librería que permite meter en el embedding nuevos puntos.

In [ ]:
%time embedding_test = embedding_train.transform(x_test)

utils.plot(embedding_test, y_test, colors=utils.MACOSKO_COLORS)
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
   --> Time elapsed: 0.76 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.05 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration   50, KL divergence 213893.9378, 50 iterations in 0.2370 sec
Iteration  100, KL divergence 212358.2086, 50 iterations in 0.2640 sec
Iteration  150, KL divergence 211368.8011, 50 iterations in 0.2532 sec
Iteration  200, KL divergence 210642.9236, 50 iterations in 0.2411 sec
Iteration  250, KL divergence 210092.0278, 50 iterations in 0.2760 sec
   --> Time elapsed: 1.27 seconds
CPU times: total: 18 s
Wall time: 2.45 s
No description has been provided for this image

Todo junto¶

Superpone los puntos transformados en el embeddingoriginal.

In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train, y_train, colors=utils.MACOSKO_COLORS, alpha=0.25, ax=ax)
utils.plot(embedding_test, y_test, colors=utils.MACOSKO_COLORS, alpha=0.75, ax=ax)
No description has been provided for this image

Ejercicios con OpenTSNE¶

Aplica el modelo t-SNE al dataset MNIST.

Load data¶

Load MNIST dataset: https://www.kaggle.com/weiouyang/test-dataset/version/1

In [ ]:
import gzip
import pickle
import sys
import matplotlib.pyplot as plt

f = gzip.open('data/mnist.pkl.gz', 'rb')
if sys.version_info < (3,):
    (X_train, y_train), (X_test, y_test) = pickle.load(f)
else:
    (X_train0, y_train0), (X_test0, y_test0) = pickle.load(f, encoding="bytes")
    
print(X_train0.shape)
print(y_train.shape)

for i in range(9):  
    plt.subplot(330 + 1 + i)
    plt.imshow(X_train0[i], cmap=plt.get_cmap('gray'))
(60000, 28, 28)
(60000,)
No description has been provided for this image
In [ ]:
X_train0=X_train0.reshape(60000,-1)
y_train0 = y_train0.astype(str)

X_test0 = X_test0.reshape(10000,-1)
y_test0 = y_test0.astype(str)
x_train, x_test, y_train, y_test = train_test_split(X_train0, y_train0, test_size=.25, random_state=42)

- Empieza ejecutando la función con los valores por defecto (perplexity = 30, early_exaggeration = 12, initialization='pca') para un subconjunto de train del 75% de la muestra.¶

In [ ]:
#X_train0 = X_train0[:10000]
#y_train = y_train[:10000]
#X_test0 = X_test0[:10000]
#y_test = y_test[:10000]
In [ ]:
tsne = TSNE(perplexity = 30, metric = 'euclidean', early_exaggeration = 12,  random_state = 42, n_jobs = -1, verbose = True, initialization = 'pca')

embedding_train_default = tsne.fit(x_train)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, n_jobs=-1, random_state=42, verbose=True)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 14.91 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 1.93 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.63 seconds
===> Running optimization with exaggeration=12.00, lr=3750.00 for 250 iterations...
Iteration   50, KL divergence 5.6614, 50 iterations in 1.4180 sec
Iteration  100, KL divergence 5.5525, 50 iterations in 1.4801 sec
Iteration  150, KL divergence 5.5307, 50 iterations in 1.4489 sec
Iteration  200, KL divergence 5.5215, 50 iterations in 1.4697 sec
Iteration  250, KL divergence 5.5154, 50 iterations in 1.4300 sec
   --> Time elapsed: 7.25 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations...
Iteration   50, KL divergence 3.2404, 50 iterations in 1.8101 sec
Iteration  100, KL divergence 2.9987, 50 iterations in 3.2319 sec
Iteration  150, KL divergence 2.8767, 50 iterations in 4.7148 sec
Iteration  200, KL divergence 2.7982, 50 iterations in 5.8753 sec
Iteration  250, KL divergence 2.7418, 50 iterations in 6.8423 sec
Iteration  300, KL divergence 2.6983, 50 iterations in 8.1364 sec
Iteration  350, KL divergence 2.6637, 50 iterations in 9.0462 sec
Iteration  400, KL divergence 2.6351, 50 iterations in 10.1843 sec
Iteration  450, KL divergence 2.6107, 50 iterations in 10.9289 sec
Iteration  500, KL divergence 2.5903, 50 iterations in 11.7919 sec
   --> Time elapsed: 72.57 seconds
In [ ]:
utils.plot(embedding_train_default, y_train)
No description has been provided for this image
In [ ]:
embedding_test_default = embedding_train_default.transform(x_test)
utils.plot(embedding_test_default, y_test)
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
   --> Time elapsed: 1.94 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.12 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration   50, KL divergence 216701.8935, 50 iterations in 0.2660 sec
Iteration  100, KL divergence 214971.9142, 50 iterations in 0.2700 sec
Iteration  150, KL divergence 213923.0646, 50 iterations in 0.2721 sec
Iteration  200, KL divergence 213234.8219, 50 iterations in 0.2690 sec
Iteration  250, KL divergence 212728.6054, 50 iterations in 0.2780 sec
   --> Time elapsed: 1.36 seconds
No description has been provided for this image

Podemos dibujar ambas combinadas:

In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 12')
Out[ ]:
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 12')
No description has been provided for this image

- Ejecuta el modelo sin early_exaggeration (early_exaggeration=1). ¿Qué diferencias observas y a qué se deben?¶

In [ ]:
tsne2 = TSNE(perplexity = 30, metric = 'euclidean', early_exaggeration = 1,  random_state = 42, n_jobs = 8, verbose = True, initialization = 'pca')

embedding_train_default2 = tsne2.fit(x_train)
embedding_test_default2 = embedding_train_default2.transform(x_test)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=1, n_jobs=8, random_state=42, verbose=True)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 16.08 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.47 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.63 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 250 iterations...
Iteration   50, KL divergence 3.4339, 50 iterations in 1.4179 sec
Iteration  100, KL divergence 3.1556, 50 iterations in 2.5063 sec
Iteration  150, KL divergence 3.0148, 50 iterations in 3.6378 sec
Iteration  200, KL divergence 2.9265, 50 iterations in 4.9491 sec
Iteration  250, KL divergence 2.8638, 50 iterations in 5.9922 sec
   --> Time elapsed: 18.50 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations...
Iteration   50, KL divergence 2.8159, 50 iterations in 7.2589 sec
Iteration  100, KL divergence 2.7786, 50 iterations in 7.8043 sec
Iteration  150, KL divergence 2.7476, 50 iterations in 9.1626 sec
Iteration  200, KL divergence 2.7220, 50 iterations in 10.5171 sec
Iteration  250, KL divergence 2.7004, 50 iterations in 11.3700 sec
Iteration  300, KL divergence 2.6817, 50 iterations in 12.9160 sec
Iteration  350, KL divergence 2.6646, 50 iterations in 13.2518 sec
Iteration  400, KL divergence 2.6497, 50 iterations in 14.3092 sec
Iteration  450, KL divergence 2.6365, 50 iterations in 16.1026 sec
Iteration  500, KL divergence 2.6243, 50 iterations in 16.4693 sec
   --> Time elapsed: 119.16 seconds
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
   --> Time elapsed: 1.91 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.02 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration   50, KL divergence 217642.0613, 50 iterations in 0.2180 sec
Iteration  100, KL divergence 215870.9300, 50 iterations in 0.2250 sec
Iteration  150, KL divergence 214806.1101, 50 iterations in 0.2249 sec
Iteration  200, KL divergence 214103.7596, 50 iterations in 0.2266 sec
Iteration  250, KL divergence 213600.9587, 50 iterations in 0.2223 sec
   --> Time elapsed: 1.12 seconds
In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 12')
Out[ ]:
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 12')
No description has been provided for this image
In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default2, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default2, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 1')
Out[ ]:
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 1')
No description has been provided for this image
In [ ]:
import pandas as pd

kl_train_1 = embedding_train_default.kl_divergence
kl_test_1 = embedding_test_default.kl_divergence

kl_train_2 = embedding_train_default2.kl_divergence
kl_test_2 = embedding_test_default2.kl_divergence

df_kl = pd.DataFrame({
    'KL (train)': [kl_train_1, kl_train_2],
    'KL (test)': [kl_test_1, kl_test_2]
}, index=['Embedding 1', 'Embedding 2'])

df_kl
Out[ ]:
KL (train) KL (test)
Embedding 1 2.589916 206638.286184
Embedding 2 2.624046 207510.810251

El factor de early_exaggeration se utiliza típicamente durante la fase inicial. Este aumenta, básicamente, las fuerzas atractivas entre los puntos y permite que los puntos se muevan más libremente, encontrando más fácilmente los vecinos más cercanos.

El aumento de este valor suele llevar a clusters más separados. Es un poco difícil de discernir visualmente, pero sí que hay algunos indicios de este comportamiento. Vemos por ejemplo que en el caso de early_exaggeration = 1 tenemos la clase 2 dividida en varios fragmentos, sin estar toda unida.

- Ejecuta el modelo con los valores por defecto pero cambiando la inicialización a random. ¿Qué ocurre? ¿Obtenemos mejores o peores resultados que en el caso anterior? Compara también los tiempos de ejecución y comenta porqué difieren.¶

In [ ]:
tsne3 = TSNE(perplexity = 30, metric = 'euclidean', early_exaggeration = 12,  random_state = 42, n_jobs = 8, verbose = True, initialization = 'random')

embedding_train_default3 = tsne3.fit(x_train)
embedding_test_default3 = embedding_train_default3.transform(x_test)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, initialization='random', n_jobs=8, random_state=42,
     verbose=True)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 15.91 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.43 seconds
===> Running optimization with exaggeration=12.00, lr=3750.00 for 250 iterations...
Iteration   50, KL divergence 7.0629, 50 iterations in 1.3290 sec
Iteration  100, KL divergence 5.6295, 50 iterations in 1.3935 sec
Iteration  150, KL divergence 5.5525, 50 iterations in 1.3452 sec
Iteration  200, KL divergence 5.5316, 50 iterations in 1.3288 sec
Iteration  250, KL divergence 5.5247, 50 iterations in 1.4411 sec
   --> Time elapsed: 6.84 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations...
Iteration   50, KL divergence 3.2632, 50 iterations in 1.5996 sec
Iteration  100, KL divergence 3.0169, 50 iterations in 2.8510 sec
Iteration  150, KL divergence 2.8940, 50 iterations in 4.6382 sec
Iteration  200, KL divergence 2.8147, 50 iterations in 5.5003 sec
Iteration  250, KL divergence 2.7574, 50 iterations in 6.6277 sec
Iteration  300, KL divergence 2.7138, 50 iterations in 7.7444 sec
Iteration  350, KL divergence 2.6785, 50 iterations in 9.1228 sec
Iteration  400, KL divergence 2.6497, 50 iterations in 10.2350 sec
Iteration  450, KL divergence 2.6252, 50 iterations in 11.1869 sec
Iteration  500, KL divergence 2.6048, 50 iterations in 12.8381 sec
   --> Time elapsed: 72.35 seconds
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
   --> Time elapsed: 1.90 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.02 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration   50, KL divergence 216671.2177, 50 iterations in 0.2300 sec
Iteration  100, KL divergence 214961.4632, 50 iterations in 0.2420 sec
Iteration  150, KL divergence 213965.0271, 50 iterations in 0.2330 sec
Iteration  200, KL divergence 213305.9178, 50 iterations in 0.2178 sec
Iteration  250, KL divergence 212823.7289, 50 iterations in 0.2146 sec
   --> Time elapsed: 1.14 seconds
In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 12, initialization = pca')
Out[ ]:
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 12, initialization = pca')
No description has been provided for this image
In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default3, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default3, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 1, initialization = random')
Out[ ]:
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 1, initialization = random')
No description has been provided for this image
In [ ]:
import pandas as pd

kl_train_1 = embedding_train_default.kl_divergence
kl_test_1 = embedding_test_default.kl_divergence

kl_train_3 = embedding_train_default3.kl_divergence
kl_test_3 = embedding_test_default3.kl_divergence

df_kl = pd.DataFrame({
    'KL (train)': [kl_train_1, kl_train_3],
    'KL (test)': [kl_test_1, kl_test_3]
}, index=['Embedding 1', 'Embedding 3'])

df_kl
Out[ ]:
KL (train) KL (test)
Embedding 1 2.589916 206638.286184
Embedding 3 2.604412 206733.721094

Los resultados en cuestiones de divergencia de Kullback-Leibler son bastante similares, pero se obtiene un mejor resultado para el de inicialización aleatoria. Esto no es nada concluyente, ya que la aleatoriedad es muy fuerte y con solo variar el random_state cambiarían nuestros resultados.

Con el de inicialización random tarda ligeramente más, ya que estamos comenzando con una distribución aleatoria de puntos en el espacio de embedding. Esto lleva a una convergencia a la solución diferente en comparación a la inicialización con PCA, donde los puntos iniciales están más agrupados según la estructura de los datos de entrada.

- Ejecuta el modelo con 2 valores muy dispares de perplexity, por ejemplo 1 y 100, (y el resto de valores por defecto) y comenta los resultados.¶

In [ ]:
tsne4 = TSNE(perplexity = 1, metric = 'euclidean', early_exaggeration = 12,  random_state = 42, n_jobs = 8, verbose = True, 
             initialization = 'pca')

tsne5 = TSNE(perplexity = 100, metric = 'euclidean', early_exaggeration = 12,  random_state = 42, n_jobs = 8, verbose = True, 
             initialization = 'pca')
In [ ]:
embedding_train_default4 = tsne4.fit(x_train)
embedding_test_default4 = embedding_train_default4.transform(x_test)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, n_jobs=8, perplexity=1, random_state=42,
     verbose=True)
--------------------------------------------------------------------------------
===> Finding 3 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 8.20 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.02 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.63 seconds
===> Running optimization with exaggeration=12.00, lr=3750.00 for 250 iterations...
Iteration   50, KL divergence 8.0306, 50 iterations in 1.0812 sec
Iteration  100, KL divergence 7.2541, 50 iterations in 1.0988 sec
Iteration  150, KL divergence 6.8804, 50 iterations in 1.0940 sec
Iteration  200, KL divergence 6.6425, 50 iterations in 1.0825 sec
Iteration  250, KL divergence 6.4701, 50 iterations in 1.0576 sec
   --> Time elapsed: 5.41 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations...
Iteration   50, KL divergence 5.0642, 50 iterations in 1.3456 sec
Iteration  100, KL divergence 4.5599, 50 iterations in 2.4179 sec
Iteration  150, KL divergence 4.2578, 50 iterations in 3.7085 sec
Iteration  200, KL divergence 4.0420, 50 iterations in 4.7750 sec
Iteration  250, KL divergence 3.8749, 50 iterations in 6.0586 sec
Iteration  300, KL divergence 3.7386, 50 iterations in 7.1351 sec
Iteration  350, KL divergence 3.6243, 50 iterations in 7.8254 sec
Iteration  400, KL divergence 3.5257, 50 iterations in 9.1333 sec
Iteration  450, KL divergence 3.4398, 50 iterations in 10.3246 sec
Iteration  500, KL divergence 3.3625, 50 iterations in 11.2153 sec
   --> Time elapsed: 63.94 seconds
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
   --> Time elapsed: 1.89 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.02 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration   50, KL divergence 230280.3691, 50 iterations in 0.2020 sec
Iteration  100, KL divergence 228329.7409, 50 iterations in 0.2182 sec
Iteration  150, KL divergence 226994.9016, 50 iterations in 0.2317 sec
Iteration  200, KL divergence 226028.1811, 50 iterations in 0.2295 sec
Iteration  250, KL divergence 225261.7174, 50 iterations in 0.2176 sec
   --> Time elapsed: 1.10 seconds
In [ ]:
embedding_train_default5 = tsne5.fit(x_train)
embedding_test_default5 = embedding_train_default5.transform(x_test)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, n_jobs=8, perplexity=100, random_state=42,
     verbose=True)
--------------------------------------------------------------------------------
===> Finding 300 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 26.37 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 1.55 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.63 seconds
===> Running optimization with exaggeration=12.00, lr=3750.00 for 250 iterations...
Iteration   50, KL divergence 4.9143, 50 iterations in 1.9509 sec
Iteration  100, KL divergence 4.9994, 50 iterations in 2.0278 sec
Iteration  150, KL divergence 5.0014, 50 iterations in 1.9024 sec
Iteration  200, KL divergence 5.0013, 50 iterations in 1.9296 sec
Iteration  250, KL divergence 5.0013, 50 iterations in 1.8109 sec
   --> Time elapsed: 9.62 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations...
Iteration   50, KL divergence 2.5421, 50 iterations in 1.9706 sec
Iteration  100, KL divergence 2.3885, 50 iterations in 2.9697 sec
Iteration  150, KL divergence 2.3220, 50 iterations in 4.0773 sec
Iteration  200, KL divergence 2.2834, 50 iterations in 5.1269 sec
Iteration  250, KL divergence 2.2561, 50 iterations in 5.6893 sec
Iteration  300, KL divergence 2.2365, 50 iterations in 6.6018 sec
Iteration  350, KL divergence 2.2200, 50 iterations in 7.2386 sec
Iteration  400, KL divergence 2.2073, 50 iterations in 8.1607 sec
Iteration  450, KL divergence 2.1962, 50 iterations in 8.5898 sec
Iteration  500, KL divergence 2.1879, 50 iterations in 9.1279 sec
   --> Time elapsed: 59.55 seconds
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
   --> Time elapsed: 2.14 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.02 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration   50, KL divergence 217079.8194, 50 iterations in 0.2080 sec
Iteration  100, KL divergence 215606.6905, 50 iterations in 0.2101 sec
Iteration  150, KL divergence 214807.1805, 50 iterations in 0.1970 sec
Iteration  200, KL divergence 214280.5428, 50 iterations in 0.2190 sec
Iteration  250, KL divergence 213899.1103, 50 iterations in 0.2300 sec
   --> Time elapsed: 1.06 seconds
In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default4, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default4, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 1, early_exaggeration = 12')
Out[ ]:
Text(0.5, 1.0, 'perplexity = 1, early_exaggeration = 12')
No description has been provided for this image
In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default5, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default5, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 100, early_exaggeration = 12')
Out[ ]:
Text(0.5, 1.0, 'perplexity = 100, early_exaggeration = 12')
No description has been provided for this image

Cuando ejecutamos el modelo t-SNE con valores de perplexity muy bajos, como 1, y muy altos, como 100, obtenemos resultados notablemente diferentes debido a cómo afecta la perplexity al proceso de optimización de t-SNE.

  • Perplexity 1: Con una perplexity tan baja, el modelo no puede capturar adecuadamente la estructura global de los datos. Esto se debe a que la perplexity controla el número de vecinos considerados en el cálculo de las distribuciones de probabilidad. Con una perplexity de 1, solo se consideran los vecinos más cercanos, lo que resulta en una representación extremadamente local de los datos. Esto puede llevar a agrupaciones deficientes y a una interpretación inadecuada de la estructura de los datos en el espacio de menor dimensión.

  • Perplexity 100: Por el contrario, con una perplexity tan alta, el modelo puede capturar mejor la estructura global de los datos al considerar más vecinos en el cálculo de las distribuciones de probabilidad. Sin embargo, esto puede llevar a una simplificación excesiva de la estructura local de los datos, lo que resulta en una representación donde se pierden las relaciones locales en favor de la estructura global.

En resumen, al usar valores extremos de perplexity como 1 y 100, observamos problemas como la incapacidad para capturar la estructura global de los datos o la pérdida de detalles locales. Es importante ajustar la perplexity de manera adecuada para encontrar un equilibrio entre la representación global y local de los datos.

- De todas las configuraciones de t-SNE probadas en los ejercicios anteriores, escoge la que mejores resultados obtiene y aplica los datos de test al embedding. Representa el dataset entero.¶

In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 12')
Out[ ]:
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 12')
No description has been provided for this image

Ejercicios con TSNE de sklearn¶

La idea principal detrás de t-SNE es preservar la estructura local y global de los datos durante la reducción de dimensionalidad. Funciona calculando una distribución de probabilidad conjunta sobre pares de puntos en el espacio original, y una distribución de probabilidad similar en el espacio de menor dimensión.

Luego, ajusta los puntos en el espacio de menor dimensión para minimizar la divergencia entre estas dos distribuciones de probabilidad, generalmente utilizando la divergencia de Kullback-Leibler.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

colors = ['royalblue','red','deeppink', 'maroon', 'mediumorchid', 'tan', 'forestgreen', 'olive', 'goldenrod', 'lightcyan', 'navy']
vectorizer = np.vectorize(lambda x: colors[x % len(colors)])

from sklearn.datasets import make_circles
X, y = make_circles(n_samples=200, noise=0.01, random_state = 42)
plt.scatter(X[:,0], X[:,1],c=vectorizer(y))
Out[ ]:
<matplotlib.collections.PathCollection at 0x1b7ce0aac90>
No description has been provided for this image

Para más información de TSNE en sklearn se tiene el enlace siguiente:

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

  • perplexity: está relacionada con el número de vecinos cercanos que se usan en los algoritmos de manifold learning. Los datasets más grandes suelen requerir un valor mayor.

  • early_exaggeration: controla cómo de ajustados son los grupos naturales en el espacio original en el espacio incrustado y cuánto espacio habrá entre ellos. Valores más grandes indican que hay más lugar entre los grupos naturales en el espacio incrustado.

Ejecuta t-sne de sklearn con el dataset de los círculos variando la perplexity (valores 5, 30, 100).¶

In [ ]:
def visualize_tsne(X, y, perplexities):
    plt.figure(figsize=(18, 6))
    num_perplexities = len(perplexities)
    
    kl_divergences = {}  # Diccionario para almacenar las divergencias KL
    
    for i, perplexity in enumerate(perplexities, start=1):
        tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)
        X_tsne = tsne.fit_transform(X)
        
        plt.subplot(1, num_perplexities, i)
        unique_classes = np.unique(y)
        for cls in unique_classes:
            indices = np.where(y == cls)
            plt.scatter(X_tsne[indices, 0], X_tsne[indices, 1], label=f'Class {cls}', alpha=0.8)
        
        plt.title(f"Perplexity = {perplexity}")
        plt.xlabel("t-SNE Component 1")
        plt.ylabel("t-SNE Component 2")
        plt.legend()
        plt.grid(True)
        
        kl_divergences[perplexity] = tsne.kl_divergence_
    
    plt.tight_layout()
    plt.show()
    
    return kl_divergences
In [ ]:
perplexities = [5, 30, 100]
kl_divergences = visualize_tsne(X, y, perplexities)
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
No description has been provided for this image

Vemos que con un valor de perplexity bajo las cosas no parecen ir bien y el aspecto de círculo desaparece completamente.

Con un perplexity de 30 vemos en alguna zona cosas extrañas, mientras que con un valor de perplexity de 100 todo parece estar perfectamente.

¿Que KL obtienes en cada caso?¶

In [ ]:
print("KL Divergences:", kl_divergences)
KL Divergences: {5: 0.4045565128326416, 30: 0.19363267719745636, 100: 0.07120548188686371}

Vemos cómo aumentando el perplexity conseguimos divergencias de Kullback-Leibler cada vez menores, indicando que la discrepancia entre la distribución de probabilidad de los datos en el espacio original y la de estos en el reducido va disminuyendo.

Compara los tiempos de ejecución de Barnes-Hut con el método exacto. Utiliza el valor de perplexity que mejor resultado haya obtenido según el ejercicio anterior.¶

In [ ]:
import time

def run_tsne(X, perplexity, method):
    start_time = time.time()
    tsne = TSNE(n_components = 2, perplexity = perplexity, method = method, random_state = 42)
    X_tsne = tsne.fit_transform(X)
    
    end_time = time.time()
    execution_time = end_time - start_time
    return X_tsne, execution_time

perplexity = 100

# Método de Barnes-Hut
X_tsne_bh, execution_time_bh = run_tsne(X, perplexity, method = 'barnes_hut')

# Método exacto
X_tsne_exact, execution_time_exact = run_tsne(X, perplexity, method = 'exact')

print(f"Tiempo de ejecución usando Barnes-Hut: {execution_time_bh} segundos")
print(f"Tiempo de ejecución usando método exacto: {execution_time_exact} segundos")
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
Tiempo de ejecución usando Barnes-Hut: 0.28653502464294434 segundos
Tiempo de ejecución usando método exacto: 0.4057042598724365 segundos

Vemos como con el método de Barnes-Hut el tiempo requerido en el cálculo es algo inferior.

La idea principal tras este es que, en lugar de calcular la interacción de cada punto con todos los demás en el espacio de alta dimensión, aproximamos la influencia de grupos de puntos distantes usando su centroide. El espacio se divide en "celdas", por lo que en lugar de calcular todas las similitudes se calculan por esas "celdas".